11 research outputs found

    A Precise High-Dimensional Asymptotic Theory for Boosting and Minimum-â„“1\ell_1-Norm Interpolated Classifiers

    Full text link
    This paper establishes a precise high-dimensional asymptotic theory for boosting on separable data, taking statistical and computational perspectives. We consider a high-dimensional setting where the number of features (weak learners) pp scales with the sample size nn, in an overparametrized regime. Under a class of statistical models, we provide an exact analysis of the generalization error of boosting when the algorithm interpolates the training data and maximizes the empirical ℓ1\ell_1-margin. Further, we explicitly pin down the relation between the boosting test error and the optimal Bayes error, as well as the proportion of active features at interpolation (with zero initialization). In turn, these precise characterizations answer certain questions raised in \cite{breiman1999prediction, schapire1998boosting} surrounding boosting, under assumed data generating processes. At the heart of our theory lies an in-depth study of the maximum-ℓ1\ell_1-margin, which can be accurately described by a new system of non-linear equations; to analyze this margin, we rely on Gaussian comparison techniques and develop a novel uniform deviation argument. Our statistical and computational arguments can handle (1) any finite-rank spiked covariance model for the feature distribution and (2) variants of boosting corresponding to general ℓq\ell_q-geometry, q∈[1,2]q \in [1, 2]. As a final component, via the Lindeberg principle, we establish a universality result showcasing that the scaled ℓ1\ell_1-margin (asymptotically) remains the same, whether the covariates used for boosting arise from a non-linear random feature model or an appropriately linearized model with matching moments.Comment: 68 pages, 4 figure

    Abstracting Fairness: Oracles, Metrics, and Interpretability

    Get PDF
    It is well understood that classification algorithms, for example, for deciding on loan applications, cannot be evaluated for fairness without taking context into account. We examine what can be learned from a fairness oracle equipped with an underlying understanding of ``true'' fairness. The oracle takes as input a (context, classifier) pair satisfying an arbitrary fairness definition, and accepts or rejects the pair according to whether the classifier satisfies the underlying fairness truth. Our principal conceptual result is an extraction procedure that learns the underlying truth; moreover, the procedure can learn an approximation to this truth given access to a weak form of the oracle. Since every ``truly fair'' classifier induces a coarse metric, in which those receiving the same decision are at distance zero from one another and those receiving different decisions are at distance one, this extraction process provides the basis for ensuring a rough form of metric fairness, also known as individual fairness. Our principal technical result is a higher fidelity extractor under a mild technical constraint on the weak oracle's conception of fairness. Our framework permits the scenario in which many classifiers, with differing outcomes, may all be considered fair. Our results have implications for interpretablity -- a highly desired but poorly defined property of classification systems that endeavors to permit a human arbiter to reject classifiers deemed to be ``unfair'' or illegitimately derived.Comment: 17 pages, 1 figur

    Spectrum-Aware Adjustment: A New Debiasing Framework with Applications to Principal Components Regression

    Full text link
    We introduce a new debiasing framework for high-dimensional linear regression that bypasses the restrictions on covariate distributions imposed by modern debiasing technology. We study the prevalent setting where the number of features and samples are both large and comparable. In this context, state-of-the-art debiasing technology uses a degrees-of-freedom correction to remove shrinkage bias of regularized estimators and conduct inference. However, this method requires that the observed samples are i.i.d., the covariates follow a mean zero Gaussian distribution, and reliable covariance matrix estimates for observed features are available. This approach struggles when (i) covariates are non-Gaussian with heavy tails or asymmetric distributions, (ii) rows of the design exhibit heterogeneity or dependencies, and (iii) reliable feature covariance estimates are lacking. To address these, we develop a new strategy where the debiasing correction is a rescaled gradient descent step (suitably initialized) with step size determined by the spectrum of the sample covariance matrix. Unlike prior work, we assume that eigenvectors of this matrix are uniform draws from the orthogonal group. We show this assumption remains valid in diverse situations where traditional debiasing fails, including designs with complex row-column dependencies, heavy tails, asymmetric properties, and latent low-rank structures. We establish asymptotic normality of our proposed estimator (centered and scaled) under various convergence notions. Moreover, we develop a consistent estimator for its asymptotic variance. Lastly, we introduce a debiased Principal Component Regression (PCR) technique using our Spectrum-Aware approach. In varied simulations and real data experiments, we observe that our method outperforms degrees-of-freedom debiasing by a margin

    A modern maximum-likelihood theory for high-dimensional logistic regression

    Full text link
    Every student in statistics or data science learns early on that when the sample size largely exceeds the number of variables, fitting a logistic model produces estimates that are approximately unbiased. Every student also learns that there are formulas to predict the variability of these estimates which are used for the purpose of statistical inference; for instance, to produce p-values for testing the significance of regression coefficients. Although these formulas come from large sample asymptotics, we are often told that we are on reasonably safe grounds when nn is large in such a way that n≥5pn \ge 5p or n≥10pn \ge 10p. This paper shows that this is far from the case, and consequently, inferences routinely produced by common software packages are often unreliable. Consider a logistic model with independent features in which nn and pp become increasingly large in a fixed ratio. Then we show that (1) the MLE is biased, (2) the variability of the MLE is far greater than classically predicted, and (3) the commonly used likelihood-ratio test (LRT) is not distributed as a chi-square. The bias of the MLE is extremely problematic as it yields completely wrong predictions for the probability of a case based on observed values of the covariates. We develop a new theory, which asymptotically predicts (1) the bias of the MLE, (2) the variability of the MLE, and (3) the distribution of the LRT. We empirically also demonstrate that these predictions are extremely accurate in finite samples. Further, an appealing feature is that these novel predictions depend on the unknown sequence of regression coefficients only through a single scalar, the overall strength of the signal. This suggests very concrete procedures to adjust inference; we describe one such procedure learning a single parameter from data and producing accurate inferenceComment: 29 pages, 14 figures, 4 table

    The Asymptotic Distribution of the MLE in High-dimensional Logistic Models: Arbitrary Covariance

    Full text link
    We study the distribution of the maximum likelihood estimate (MLE) in high-dimensional logistic models, extending the recent results from Sur (2019) to the case where the Gaussian covariates may have an arbitrary covariance structure. We prove that in the limit of large problems holding the ratio between the number pp of covariates and the sample size nn constant, every finite list of MLE coordinates follows a multivariate normal distribution. Concretely, the jjth coordinate β^j\hat {\beta}_j of the MLE is asymptotically normally distributed with mean α⋆βj\alpha_\star \beta_j and standard deviation σ⋆/τj\sigma_\star/\tau_j; here, βj\beta_j is the value of the true regression coefficient, and τj\tau_j the standard deviation of the jjth predictor conditional on all the others. The numerical parameters α⋆>1\alpha_\star > 1 and σ⋆\sigma_\star only depend upon the problem dimensionality p/np/n and the overall signal strength, and can be accurately estimated. Our results imply that the MLE's magnitude is biased upwards and that the MLE's standard deviation is greater than that predicted by classical theory. We present a series of experiments on simulated and real data showing excellent agreement with the theory
    corecore